Search CORE

14 research outputs found

FIN-CLARIN - a humanities research infrastructure with emphasis on language

Author: Linden Bo Krister Johan
Publication venue
Publication date: 01/07/2014
Field of study

Miljardvis med ord och tusentals timmar med audio och video behövs som material för humanistisk forskning och i synnerhet språkforskning. Dessutom behöver forskarna redskap för att förädla och jämföra sina egna datasamlingar med allmänna datasamlingar. När ett forskningsprojekt är slut behövs det lagrings- och spridningsplatser för att göra rådata, redskap och forskningsresultat tillgängliga och användbara. Data, redskap och gemensamma användningsmöjligheter bildar tillsammans en forskningsinfrastruktur, som gör det möjligt att verifiera tidigare resultat och effektivare göra nya rön, när alla inte behöver starta från noll med att samla data och bygga analysredskap.Non peer reviewe

Helsingin yliopiston digitaalinen arkisto

OCR and post-correction of historical Finnish texts

Author: Drobac Senka
Kauppinen Pekka Sakari
Linden Bo Krister Johan
Publication venue: 'Linkoping University Electronic Press'
Publication date: 01/01/2017
Field of study

This paper presents experiments on Optical character recognition (OCR) as a combination of Ocropy software and data-driven spelling correction that uses Weighted Finite-State Methods. Both model training and testing were done on Finnish corpora of historical newspaper text and the best combination of OCR and post-processing models give 95.21% character recognition accuracy.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Morpheme Segmentation Gold Standards for Finnish and English

Author: Creutz Mathias Johan Philip
Linden Bo Krister Johan
Publication venue: Helsinki University of Technology
Publication date: 01/01/2004
Field of study

This document describes Hutmegs, the Helsinki University of Technology Morphological Evaluation Gold Standard package, which contains gold-standard morphological segmentations for 1.4 million Finnish and 120 000 English words. The Gold Standards comprise surface-string, or allomorph, segmentations of word forms, as well as deep-level, or morpheme, segmentations of the words.Non peer reviewe

Helsingin yliopiston digitaalinen arkisto

Evaluating HeLI with non-linear mappings

Author: Jauhiainen Heidi Annika
Jauhiainen Tommi Sakari
Linden Bo Krister Johan
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2017
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Evaluation of language identification methods using 285 languages

Author: Jauhiainen Heidi Annika
Jauhiainen Tommi Sakari
Linden Bo Krister Johan
Publication venue: 'Linkoping University Electronic Press'
Publication date: 01/01/2017
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Discriminating between Mandarin Chinese and Swiss-German varieties using adaptive language models

Author: Jauhiainen Heidi Annika
Jauhiainen Tommi Sakari
Linden Bo Krister Johan
Publication venue: The Association for Computational Linguistics
Publication date: 30/04/2019
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Language and Dialect Identification of Cuneiform Texts

Author: Alstola Tero
Jauhiainen Heidi Annika
Jauhiainen Tommi Sakari
Linden Bo Krister Johan
Publication venue: The Association for Computational Linguistics
Publication date: 30/04/2019
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Iterative Language Model Adaptation for Indo-Aryan Language Identification

Author: Jauhiainen Heidi Annika
Jauhiainen Tommi Sakari
Linden Bo Krister Johan
Publication venue: The Association for Computational Linguistics
Publication date: 01/08/2018
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Semantic Domains in Akkadian Text

Author: Jauhiainen Heidi Annika
Linden Bo Krister Johan
Sahala Antti Juha Aleksi
Svärd Saana Sofia
Publication venue: 'Brill'
Publication date: 07/08/2018
Field of study

The article examines the possibilities offered by language technology for analyzing semantic fields in Akkadian. The corpus of data for our research group is the existing electronic corpora, Open richly annotated cuneiform corpus (ORACC). In addition to more traditional Assyriological methods, the article explores two language technological methods: Pointwise mutual information (PMI) and Word2vec.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

HeLI, a Word-Based Backoff Method for Language Identification

Author: Jauhiainen Heidi Annika
Jauhiainen Tommi Sakari
Linden Bo Krister Johan
Publication venue
Publication date: 01/01/2016
Field of study

In this paper we describe the Helsinki language identification method, HeLI, and the resources we created for and used in the 3rd edition of the Discriminating between Similar Languages (DSL) shared task, which was organized as part of the VarDial 2016 workshop. The shared task comprised of a total of 8 tracks, of which we participated in 7. The shared task had a record number of participants, with 17 teams providing results for the closed track of the test set A. Our system reached the 2nd position in 4 tracks (A closed and open, B1 open and B2 open) and in this paper we are focusing on the methods and data used for those tracks. We describe our word-based back-off method in mathematical notation. We also describe how we selected the corpus we used in the open tracks.Peer reviewe

Helsingin yliopiston digitaalinen arkisto